Option Génomique Bioinformatique :¶

Session 1 "Axe 1 : Visualiation des gènes différentiellement exprimés et analyses d'enrichissement"¶

Claire Vandiedonck & Sandrine Caburet


Tutorial

In this tutorial, we will further analyse the Differential Expression between T1D patients and their related unaffected controls using microarrays transcriptome data ("Fil Rouge" UE5 and Practical Sessions in UE1).

We will first load the data and metadata.

We will also use two powerful packages:

  • ComplexHeatMap to generate enhanced heatmaps: with sample annotations on the sides, with hierarchical clustering dendograms generated with several distance metrics and clusterization aggregration methods

  • clusteProfiler to perform further enrichment analyses on sets of genes of interest

We will finally prepare input files for GSEA: Gene Set Enrichment Analysis with the GSEA Java stand-alone application.

Avant d'aller plus loin

Attention: Ne travaillez pas directement sur ce notebook pour ne pas le perdre. Dupliquez-le et renommez-le par exemple en ajoutant vos initiales et travaillez sur cette nouvelle copie. Pour ce faire, dans le panneau de gauche, faites un clic droit sur le fichier et sélectionnez "Duplicate". Puis, toujours dans la colonne de gauche, faites un clic droit sur cette copie et sélectionnez "rename" pour changer le nom. Ouvrez ensuite cette nouvelle version en double cliquant dessus. Vous êtes prêt(e) à démarrer!

N'oubliez pas de sauvegarder régulièrement votre notebook: Ctrl + S. ou en cliquant sur l'icone 💾 en haut à gauche de votre notebook ou dans le Menu du JupyterLab "File puis "Save Notebook"! Vous pouvez aussi le sauvegarder au format html: Menu "File" > Export Notebook As> Export notebook as HTML.

0. Set up parameters and RSession¶


We start with setting up paths to directories.

In [1]:
# Code cell n°1

#setwd("~/meg_m1_gb_r")
myworking_path <- "./" # modify as necessary
datapath <- "/srv/data/meg-m1-gb/DGE_results/"

We install required libraries and load them.

In [2]:
# Code cell n°2

# list the required libraries
requiredLib <- c(
    "statmod",
    "RColorBrewer",
    "dendextend",
    "stats",
    "grDevices",
    "BiocManager",
    "writexl",
    "ggnewscale",
    "ggupset",
    "writexl",
    "readxl",
    "ggridges",
    "gplots")
requiredBiocLib <- c("limma", 
                     "org.Hs.eg.db",
                     "clusterProfiler",
                     "enrichplot",
                     "ComplexHeatmap",
                     "ReactomePA")

# install required libraries if not
for (lib in requiredLib) {
  if (!require(lib, character.only = TRUE, quiet = TRUE)) {
    install.packages(lib, quiet = TRUE)
  }
}

for( lib in requiredBiocLib) {
  if (!require(lib, character.only = TRUE, quiet = TRUE)) {
  BiocManager::install(lib, quiet = TRUE)
  }
}

# load libraries
message("Loading required libraries")
for (lib in requiredLib) {
  library(lib, character.only = TRUE)}

for (lib in requiredBiocLib) {
  library(lib, character.only = TRUE)}

# kepp tracks of R and packages for this study
sessionInfo()

rm(lib, requiredLib, requiredBiocLib)
---------------------
Welcome to dendextend version 1.16.0
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags: 
	 https://stackoverflow.com/questions/tagged/dendextend

	To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------



Attaching package: ‘dendextend’


The following object is masked from ‘package:stats’:

    cutree


Bioconductor version '3.14' is out-of-date; the current release version '3.16'
  is available with R version '4.2'; see https://bioconductor.org/install


Attaching package: ‘gplots’


The following object is masked from ‘package:stats’:

    lowess



Attaching package: ‘BiocGenerics’


The following object is masked from ‘package:limma’:

    plotMA


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min


Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.



Attaching package: ‘S4Vectors’


The following object is masked from ‘package:gplots’:

    space


The following objects are masked from ‘package:base’:

    expand.grid, I, unname






clusterProfiler v4.2.2  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141


Attaching package: ‘clusterProfiler’


The following object is masked from ‘package:AnnotationDbi’:

    select


The following object is masked from ‘package:IRanges’:

    slice


The following object is masked from ‘package:S4Vectors’:

    rename


The following object is masked from ‘package:stats’:

    filter


========================================
ComplexHeatmap version 2.10.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
  genomic data. Bioinformatics 2016.

The new InteractiveComplexHeatmap package can directly export static 
complex heatmaps into an interactive Shiny app with zero effort. Have a try!

This message can be suppressed by:
  suppressPackageStartupMessages(library(ComplexHeatmap))
========================================


ReactomePA v1.38.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use ReactomePA in published research, please cite:
Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479

Loading required libraries

R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /srv/conda/envs/notebook/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ReactomePA_1.38.0     ComplexHeatmap_2.10.0 enrichplot_1.14.2    
 [4] clusterProfiler_4.2.2 org.Hs.eg.db_3.14.0   AnnotationDbi_1.56.2 
 [7] IRanges_2.28.0        S4Vectors_0.32.4      Biobase_2.54.0       
[10] BiocGenerics_0.40.0   limma_3.50.3          gplots_3.1.3         
[13] ggridges_0.5.4        readxl_1.4.1          ggupset_0.3.0        
[16] ggnewscale_0.4.8      writexl_1.4.1         BiocManager_1.30.19  
[19] dendextend_1.16.0     RColorBrewer_1.1-3    statmod_1.4.37       

loaded via a namespace (and not attached):
  [1] backports_1.4.1        uuid_1.1-0             shadowtext_0.1.2      
  [4] circlize_0.4.15        fastmatch_1.1-3        plyr_1.8.8            
  [7] igraph_1.3.5           repr_1.1.4             lazyeval_0.2.2        
 [10] splines_4.1.3          BiocParallel_1.28.3    GenomeInfoDb_1.30.1   
 [13] ggplot2_3.4.0          digest_0.6.30          foreach_1.5.2         
 [16] yulab.utils_0.0.5      htmltools_0.5.3        GOSemSim_2.20.0       
 [19] viridis_0.6.2          GO.db_3.14.0           fansi_1.0.3           
 [22] checkmate_2.1.0        magrittr_2.0.3         memoise_2.0.1         
 [25] cluster_2.1.4          doParallel_1.0.17      Biostrings_2.62.0     
 [28] graphlayouts_0.8.3     matrixStats_0.62.0     colorspace_2.0-3      
 [31] rappdirs_0.3.3         blob_1.2.3             ggrepel_0.9.2         
 [34] dplyr_1.0.10           crayon_1.5.2           RCurl_1.98-1.9        
 [37] jsonlite_1.8.3         graph_1.72.0           scatterpie_0.1.8      
 [40] iterators_1.0.14       ape_5.6-2              glue_1.6.2            
 [43] polyclip_1.10-4        gtable_0.3.1           zlibbioc_1.40.0       
 [46] XVector_0.34.0         GetoptLong_1.0.5       graphite_1.40.0       
 [49] shape_1.4.6            scales_1.2.1           DOSE_3.20.1           
 [52] DBI_1.1.3              Rcpp_1.0.9             viridisLite_0.4.1     
 [55] clue_0.3-62            gridGraphics_0.5-1     tidytree_0.4.1        
 [58] reactome.db_1.77.0     bit_4.0.4              httr_1.4.4            
 [61] fgsea_1.20.0           pkgconfig_2.0.3        farver_2.1.1          
 [64] utf8_1.2.2             ggplotify_0.1.0        tidyselect_1.2.0      
 [67] rlang_1.0.6            reshape2_1.4.4         munsell_0.5.0         
 [70] cellranger_1.1.0       tools_4.1.3            cachem_1.0.6          
 [73] downloader_0.4         cli_3.4.1              generics_0.1.3        
 [76] RSQLite_2.2.18         evaluate_0.18          stringr_1.4.1         
 [79] fastmap_1.1.0          ggtree_3.7.1.001       bit64_4.0.5           
 [82] tidygraph_1.2.2        caTools_1.18.2         purrr_0.3.5           
 [85] KEGGREST_1.34.0        ggraph_2.1.0           nlme_3.1-160          
 [88] aplot_0.1.8            DO.db_2.9              compiler_4.1.3        
 [91] png_0.1-7              treeio_1.23.0          tibble_3.1.8          
 [94] tweenr_2.0.2           stringi_1.7.8          lattice_0.20-45       
 [97] IRdisplay_1.1          Matrix_1.5-3           vctrs_0.5.0           
[100] pillar_1.8.1           lifecycle_1.0.3        GlobalOptions_0.1.2   
[103] data.table_1.14.4      bitops_1.0-7           patchwork_1.1.2       
[106] qvalue_2.26.0          R6_2.5.1               KernSmooth_2.23-20    
[109] gridExtra_2.3          codetools_0.2-18       MASS_7.3-58.1         
[112] gtools_3.9.3           assertthat_0.2.1       rjson_0.2.21          
[115] withr_2.5.0            GenomeInfoDbData_1.2.7 parallel_4.1.3        
[118] ggfun_0.0.8            IRkernel_1.2           tidyr_1.2.1           
[121] pbdZMQ_0.3-8           ggforce_0.4.1          base64enc_0.1-3       

1. Load Data¶


At the end of the Fil Rouge on microarrays, we ended up with several R objects that we will reload in this R Session:

1.A. Metadata¶

  • probes
In [3]:
# Code cell n°3

load(paste(datapath,"probes.RData",sep = ""))
ls()
str(probes)
gene_list <- probes$TargetID
cat("There are ", length(unique(gene_list)), "unique genes\n")
  1. 'datapath'
  2. 'myworking_path'
  3. 'probes'
'data.frame':	47323 obs. of  9 variables:
 $ ProbeID              : int  6450255 2570615 6370619 2600039 2650615 5340672 2000519 3870044 7050209 1580181 ...
 $ CHROMOSOME           : chr  "7" "19" "19" "10" ...
 $ CYTOBAND             : chr  "7p15.3e" "19q13.43c" "19q13.43c" "10q11.23c" ...
 $ PROBE_CHR_ORIENTATION: chr  "-" "-" "-" "-" ...
 $ PROBE_COORDINATES    : chr  "20147187-20147236" "63548541-63548590" "63549180-63549229" "52566586-52566635" ...
 $ REFSEQ_ID            : chr  "NM_182762.2" "NM_130786.2" "NM_130786.2" "NM_138932.1" ...
 $ ENTREZ_GENE_ID       : int  346389 1 1 29974 29974 29974 23784 23784 23784 54715 ...
 $ TargetID             : chr  "7A5" "A1BG" "A1BG" "A1CF" ...
 $ SYMBOL               : chr  "7A5" "A1BG" "A1BG" "A1CF" ...
There are  34694 unique genes
  • samples information (ID, pedigree ID, sex, age, status, stimulation level...)
In [4]:
# Code cell n°4

load(paste(datapath,"samples_info.RData", sep = ""))
ls()
str(samples_info)
  1. 'datapath'
  2. 'gene_list'
  3. 'myworking_path'
  4. 'probes'
  5. 'samples_info'
'data.frame':	264 obs. of  8 variables:
 $ array.labels: chr  "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ...
 $ PedID       : chr  "1" "1" "1" "1" ...
 $ ID          : chr  "1" "1" "1" "1" ...
 $ Status      : chr  "2" "2" "2" "2" ...
 $ Stim        : chr  "0" "0" "6" "6" ...
 $ Full        : chr  "1_1_2_0" "1_1_2_0" "1_1_2_6" "1_1_2_6" ...
 $ Sex         : chr  "1" "1" "1" "1" ...
 $ Age         : int  10 10 10 10 10 10 18 18 18 18 ...

1.B. Normalized expression data¶

We load the norm.quant object obtained after log2 transformation and between-samples quantile normalization.

In [5]:
# Code cell n°5

load(paste(datapath, "norm.quant.RData", sep = ""))
ls()
str(norm.quant)
norm.quant[1:10, 1:10]
  1. 'datapath'
  2. 'gene_list'
  3. 'myworking_path'
  4. 'norm.quant'
  5. 'probes'
  6. 'samples_info'
 num [1:47323, 1:264] 7.35 7.09 6.3 6.57 6.49 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:47323] "6450255" "2570615" "6370619" "2600039" ...
  ..$ : chr [1:264] "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ...
A matrix: 10 × 10 of type dbl
5753669129_B5753669129_C5753669129_E5753669129_K5753669129_A5753669129_F5753669095_B5753669095_F5753669095_G5753669095_K
64502557.3515497.0560637.0529477.3980737.2237276.8667527.0192866.9869847.2897157.082987
25706157.0877706.9090356.8745726.7942546.6353616.7668376.6507536.8411886.8168506.684924
63706196.2995246.7542026.5506786.5742706.6250256.4736276.4019156.3186596.4619616.427141
26000396.5725196.6092026.5689406.3918016.4359656.7827946.4686966.5373516.4944136.615691
26506156.4923146.4949686.7738846.5077316.9353576.4287736.6068656.6180286.4345436.448600
53406726.1940346.1386956.0691356.1927786.2474626.2355625.9092826.0656576.0562666.199995
20005196.7172926.4533346.8833986.9585286.9136436.3579446.6495136.7746906.4858546.731937
38700445.8942375.7530325.9819515.9918175.7441945.7281025.8420475.6992045.6722245.871414
70502096.7336646.6997756.5693746.5278736.8509506.7694876.5691426.5396636.7867686.549867
15801816.7402986.4026606.3299716.4021466.5361896.6281876.6242506.4162596.4154316.628095

Each row corresponds to an Illumina probeID (probe Sets), while each column name corresponds to an Illumina array sample. The correspondance between probeID and gene names is provided in the probes object, while the experimental design matrix of samples is provided in the samples_info object.

1.C. Results of the DGE analysis¶

  • At the end of the DGE analysis, we stored all results in a list limma.full.outs with 6 dataframes:

      1. with the result of the full model analysis: testing whether gene expression varies in at least one condition
    
      2. with the result of the first contrast for the DGE analysis
    
      3. etc...

We saved this list in an excel document DGE_T1D_microarrays.xlsx, each tab corresponding to one list element (i.e. one tab corresponding to one analysis).

  • Here we upload the data and put them back in a list R object, using a Read Excel function.
In [6]:
# Code cell n°6

limma.full.outs <- list()

sheet_names <- readxl::excel_sheets(paste(datapath, "DGE_T1D_microarrays.xlsx", sep = ""))

for (sheet in sheet_names) {
limma.full.outs[[sheet]] <- as.data.frame(read_xlsx(paste(datapath, "DGE_T1D_microarrays.xlsx", sep = ""), sheet = sheet))
    }

This list contains the following 6 dataframes:

In [7]:
# Code cell n°7

print(names(limma.full.outs))
[1] "limma.fullmodel.out" "limma.Pat.H0.out"    "limma.Pat.H24.out"  
[4] "limma.Pat.H6.out"    "limma.Pat.H624.out"  "limma.Pat.out"      
  • We can check the structure of the full model:
In [8]:
# Code cell n°8

str(limma.full.outs[[1]])
'data.frame':	47323 obs. of  18 variables:
 $ ProbeID              : chr  "3120474" "3310376" "630725" "1430709" ...
 $ Pat.H0               : num  -0.58 0.327 -0.569 -0.285 0.328 ...
 $ Pat.H6               : num  0.1486 0.0132 -0.2268 -0.0714 0.0257 ...
 $ Pat.H24              : num  0.00847 -0.07835 -0.1993 -0.35116 0.16611 ...
 $ Pat.H6H24            : num  0.1571 -0.0652 -0.4261 -0.4226 0.1918 ...
 $ Patient              : num  -1.583 0.915 -2.133 -1.277 1.175 ...
 $ AveExpr              : num  7.81 11.45 7.92 6.97 7.27 ...
 $ F                    : num  21.6 17 14.2 14 13.9 ...
 $ P.Value              : num  1.55e-12 3.86e-10 1.29e-08 1.64e-08 1.83e-08 ...
 $ adj.P.Val            : num  7.34e-08 9.14e-06 1.60e-04 1.60e-04 1.60e-04 ...
 $ CHROMOSOME           : chr  "22" "11" "12" "17" ...
 $ CYTOBAND             : chr  "22q11.23c" "11q12.1a" "12q15a" "17q11.2e" ...
 $ PROBE_CHR_ORIENTATION: chr  "+" "-" "-" "+" ...
 $ PROBE_COORDINATES    : chr  "23923092-23923141" "57296016-57296065" "68548736-68548785" "28348895-28348909:28348910-28348944" ...
 $ REFSEQ_ID            : chr  "XM_371461.4" "NM_012456.2" "NM_000619.2" "NM_173847.3" ...
 $ ENTREZ_GENE_ID       : num  85379 26519 3458 124912 638 ...
 $ TargetID             : chr  "KIAA1671" "TIMM10" "IFNG" "SPACA3" ...
 $ SYMBOL               : chr  "KIAA1671" "TIMM10" "IFNG" "SPACA3" ...

We have the log2 mean expression per condition, followed by the average expression in all conditions, then the statistic value F and the corresponding P.Val and adj.P.Val after Benjamini-Hochberg correction. Then you have probes annotations.

  • Finally, we can check the structure of any contrast results, like below, or the contrast between Patients and Controls, whatever the stimulation:
In [9]:
# Code cell n°9

str(limma.full.outs[[6]])
'data.frame':	47323 obs. of  17 variables:
 $ ProbeID              : chr  "3120474" "3310376" "630725" "6280440" ...
 $ logFC                : num  -1.583 0.915 -2.133 1.175 0.665 ...
 $ CI.L                 : num  -1.979 0.657 -2.797 0.799 0.449 ...
 $ CI.R                 : num  -1.186 1.174 -1.47 1.551 0.882 ...
 $ AveExpr              : num  7.81 11.45 7.92 7.27 7.86 ...
 $ t                    : num  -7.86 6.97 -6.33 6.15 6.04 ...
 $ P.Value              : num  9.51e-14 2.48e-11 1.02e-09 2.86e-09 5.02e-09 ...
 $ adj.P.Val            : num  4.50e-09 5.86e-07 1.62e-05 3.39e-05 4.75e-05 ...
 $ B                    : num  20 14.9 11.5 10.6 10.1 ...
 $ CHROMOSOME           : chr  "22" "11" "12" "22" ...
 $ CYTOBAND             : chr  "22q11.23c" "11q12.1a" "12q15a" "22q13.2c" ...
 $ PROBE_CHR_ORIENTATION: chr  "+" "-" "-" "+" ...
 $ PROBE_COORDINATES    : chr  "23923092-23923141" "57296016-57296065" "68548736-68548785" "41855291-41855340" ...
 $ REFSEQ_ID            : chr  "XM_371461.4" "NM_012456.2" "NM_000619.2" "NM_001197.3" ...
 $ ENTREZ_GENE_ID       : num  85379 26519 3458 638 282969 ...
 $ TargetID             : chr  "KIAA1671" "TIMM10" "IFNG" "BIK" ...
 $ SYMBOL               : chr  "KIAA1671" "TIMM10" "IFNG" "BIK" ...

We have the log2 Fold Change between the two conditions (here log2(Pat) - log2(Cont)), with its confidence interval, followed by the average expression in both conditions, then the statistic value t, the P.Value and adj.P.Val after Benjamini-Hochberg correction and the B statiscics (the one used for pvalue). Then you have probe annotations.

2. Enhanced heatmaps with ComplexHeatmap¶


There are two main packages to draw enhanced heatmaps:

  • pheatmap: a CRAN packages.
  • ComplexHeatmap on GitHub but also available in Bioconductor, the second generation of pheatmap.

ComplexHeatmap is the most flexible tool to draw heatmaps with a lot of options, notably to add annotations.

Below is presented a quick example on ComplexHeatmap usage.

  • Let's first take an output from limma DGE analysis, for example with the full model and the 34 differentially expressed genes. We will save it in a matrix format beacause ComplexHeatmap works with matrices.
In [10]:
# Code cell n°10

probes_sig <- subset(limma.full.outs$limma.fullmodel.out, adj.P.Val < 0.01 )[, "ProbeID"]
length(probes_sig)
mat_full <- norm.quant[row.names(norm.quant) %in% probes_sig,]
str(mat_full)
colnames(mat_full) <- samples_info$Status
row.names(mat_full) <- probes$TargetID[probes$ProbeID %in% probes_sig]
34
 num [1:34, 1:264] 7.84 7.41 7.54 10.67 9.16 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:34] "1820347" "6280440" "2510546" "6650079" ...
  ..$ : chr [1:264] "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ...
  • In limma, the coolmap() function could draw a simple heatmap :
In [11]:
# Code cell n°11
options(repr.plot.width = 10, repr.plot.height = 10)
limma::coolmap(mat_full, col = "redblue",cexCol = 0.6)
  • To use ComplexHeatmap, the dataset must be a matrix with samples in columns and genes in rows. We thus transpose our matrix mat_full.

  • In addition, we use the function scale() to center our data with a Z score:

In [12]:
# Code cell n°12

tmat_full <- t(apply(mat_full, 1, scale))
  • Then you select a distance for the similarity between samples. It can be a pre-defined character which is in ("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski", "pearson", "spearman", "kendall"). Default is "euclidean". It can also be a function. In R the function to compute distances is dist(). The correlation distance is defined as 1 - cor(x, y, method). See there for further details.
In [13]:
# Code cell n°13

ComplexHeatmap::Heatmap(tmat_full,
                        name = "Z-score",
                        clustering_distance_rows = "pearson",
                        column_title = "pre-defined distance method (1 - pearson)")
  • You can also select the clustering method by clustering_method_rows and clustering_method_columns. Possible methods are those supported in hclust() function: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
In [14]:
# Code cell n°14

ComplexHeatmap::Heatmap(tmat_full,
                        name = "Z-score",
                        clustering_distance_rows = "pearson",
                        column_title = "pre-defined distance method (1 - pearson)",
                        clustering_method_rows = "ward.D")
  • One nice usage of ComplexHeatmap is the possibility to add custom annotations: ha for "heatmap annotation" is the object name used in ComplexHeatmap tutorial.

    • You can add annotations on samples, either quantitative with the fundtion anno_points() or anno_barplot() or qualitative with a dataframe. The col argument is used to specify the colors of the different categorical values. For the quantitative values, you can add points or boxplots for example.
In [15]:
# Code cell n°15

ha <- HeatmapAnnotation(age = anno_points(samples_info$Age),
                        df = data.frame(Sex = samples_info$Sex,
                                        Stim = samples_info$Stim,
                                        Status = samples_info$Status),
                        col = list(Status = c("1" = "blue", "2" = "red"),
                                   Stim = c("0" = "lightgreen", "6"= "green", "24" = "darkgreen"),
                                   Sex = c("1" = "lightblue", "2" = "pink"))
                        )
ComplexHeatmap::Heatmap(tmat_full,
                        name = "Z-score",
                        top_annotation = ha,
                        row_names_gp = gpar(fontsize = 10)
                       )

Below is another example without clustering the samples, saving the plot in an R object then drawing the heatmap with the draw() function.

In [16]:
# Code cell n°16

h_tx <- ComplexHeatmap::Heatmap(tmat_full,
                                name = "Z-score",
                                top_annotation = ha,
                                row_names_gp = gpar(fontsize = 10),
                                cluster_columns = FALSE)

ComplexHeatmap::draw(h_tx)

You could also add annotations to genes: adding gene level expression for example (see next example), or coloring genes given gene sets.

In [17]:
# Code cell n°17

ha_col <- columnAnnotation(age = anno_points(samples_info$Age),
                        df = data.frame(Sex = samples_info$Sex,
                                        Stim = samples_info$Stim,
                                        Status = samples_info$Status),
                        col = list(Status = c("1" = "blue", "2" = "red"),
                                   Stim = c("0" = "lightgreen", "6"= "green", "24" = "darkgreen"),
                                   Sex = c("1" = "lightblue", "2" = "pink"))
                        )

ha_row <- rowAnnotation(mean_expr = anno_points(subset(limma.full.outs$limma.fullmodel.out, adj.P.Val < 0.01 )$AveExpr)
)
In [18]:
# Code cell n°18
h_tmat_full <- ComplexHeatmap::Heatmap(tmat_full,
                                name = "Z-score",
                                top_annotation = ha_col,
                                right_annotation = ha_row,
                                row_names_gp = gpar(fontsize = 10),
                                cluster_columns = FALSE)

ComplexHeatmap::draw(h_tmat_full)

=> There are many more possibilities with ComplexHeatmap, for example to play with color gradient, to display dendograms with different colors, to split the heatmap with the clusters, to draw heatmaps next to each other in a composite plot... Just refer to its guide!

Note: You could also try to play with two other packages to draw interactive heatmaps:

  • InteractiveComplexHeatmap
  • heatmaply as explained in this short tutorial.
    They were not installed on adenine.
Success: Well done! You now know how to draw annotated heatmaps.

3. Enrichment analysis with clusterProfiler¶


There are two main packages to study gene sets enrichment in R.

  • gprofiler2: a CRAN packages to do similar analyses as in the web interface of G:Profiler

  • clusterProfiler available in Bioconductor.

clusterProfiler is by far the most complete and flexible tool to perform ORAs (Over-Representation Analyses) that are single-gene enrichment methods, or to perform SAFE (Significance Analysis of Function and Expression) like GSEA.

Its guide is available at both these links:

  • https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html
  • https://yulab-smu.top/biomedical-knowledge-mining-book/index.html

3.1. ORA¶

clusterProfiler can be used for several ORAs on different datasets: Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, Disease Ontology (DO), Disease Gene Network (DisGeNET), wikiPathways, Molecular Signatures Database (MSigDb) and custom datasets.

We will illustrate clusterProfiler usage on GO, Reactome and KEGG.

3.1.A. GO¶

clusterProfiler needs a set of unique genes.

In [19]:
# Code cell n°19

mygenes <- unique(row.names(mat_full))

The function to run it for GO ebrichment is enrichGO(). For the ont (Ontology) argument, we can specify either molecular function ("MF"), cellular components ("CC"), biological processes ("BP") or "ALL".

Here we use org.Hs.eg.db as the database collecting all genes for the human genome with their different identifiers. This database is available in a bioconductor package: https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

Caution: When performing enrichment analyses, you should not run the analysis againts all genes in the genome, but only on the subset of expressed genes in your samples. We call it the universe. Here we select all unique expressed genes.
For the `keyType` in GO enrichment, we can use the genes SYMBOL rather than the ENSEMBL or ENTREZID. The `minGSize` and `maxGsize` are parameters to set up the min and max of number of genes to be part of a gene set to be tested. Finally, we select statistical parameters: a pvalueCutoff after an adjustment for multiple testing.
In [20]:
# Code cell n°20

enr_go <- enrichGO(gene = mygenes, 
             ont ="ALL", 
             OrgDb = org.Hs.eg.db,
             universe = unique(probes$TargetID),
             keyType = "SYMBOL", 
             minGSSize = 3, 
             maxGSSize = 800, 
             pvalueCutoff = 0.1, 
             pAdjustMethod = "BH")

Be patient, the above command can last a minute or so

We can visualize for example to top 20 enrichments:

In [21]:
# Code cell n°21
head(enr_go, 20)
A data.frame: 20 × 10
ONTOLOGYIDDescriptionGeneRatioBgRatiopvaluep.adjustqvaluegeneIDCount
<chr><chr><chr><chr><chr><dbl><dbl><dbl><chr><int>
GO:0060333BPGO:0060333interferon-gamma-mediated signaling pathway 3/2327/15612 7.983325e-060.0080631580.006168169IFNG/IRF1/STAT1 3
GO:0034341BPGO:0034341response to interferon-gamma 4/23138/156124.541988e-050.0229370400.017546417IFNG/IRF1/STAT1/UBD 4
GO:0051770BPGO:0051770positive regulation of nitric-oxide synthase biosynthetic process2/2316/15612 2.460307e-040.0541143410.041396484IFNG/STAT1 2
GO:0051767BPGO:0051767nitric-oxide synthase biosynthetic process 2/2320/15612 3.881544e-040.0541143410.041396484IFNG/STAT1 2
GO:0051769BPGO:0051769regulation of nitric-oxide synthase biosynthetic process 2/2320/15612 3.881544e-040.0541143410.041396484IFNG/STAT1 2
GO:0051607BPGO:0051607defense response to virus 4/23242/156123.954710e-040.0541143410.041396484IFNG/IRF1/RNASE6/STAT1 4
GO:0140546BPGO:0140546defense response to symbiont 4/23242/156123.954710e-040.0541143410.041396484IFNG/IRF1/RNASE6/STAT1 4
GO:0035458BPGO:0035458cellular response to interferon-beta 2/2321/15612 4.286284e-040.0541143410.041396484IRF1/STAT1 2
GO:0071346BPGO:0071346cellular response to interferon-gamma 3/23116/156126.350974e-040.0569755130.043585228IFNG/IRF1/STAT1 3
GO:0030097BPGO:0030097hemopoiesis 6/23762/156126.547515e-040.0569755130.043585228CD79B/IFNG/IRF1/SH3PXD2A/STAT1/UBD6
GO:0045589BPGO:0045589regulation of regulatory T cell differentiation 2/2326/15612 6.603882e-040.0569755130.043585228IFNG/IRF1 2
GO:0035456BPGO:0035456response to interferon-beta 2/2328/15612 7.667074e-040.0569755130.043585228IRF1/STAT1 2
GO:0002521BPGO:0002521leukocyte differentiation 5/23518/156128.082284e-040.0569755130.043585228CD79B/IFNG/IRF1/SH3PXD2A/UBD 5
GO:0045066BPGO:0045066regulatory T cell differentiation 2/2329/15612 8.227632e-040.0569755130.043585228IFNG/IRF1 2
GO:0048534BPGO:0048534hematopoietic or lymphoid organ development 6/23800/156128.461710e-040.0569755130.043585228CD79B/IFNG/IRF1/SH3PXD2A/STAT1/UBD6
GO:0009615BPGO:0009615response to virus 4/23338/156121.379932e-030.0828820120.063403228IFNG/IRF1/RNASE6/STAT1 4
GO:0030099BPGO:0030099myeloid cell differentiation 4/23339/156121.395044e-030.0828820120.063403228IFNG/SH3PXD2A/STAT1/UBD 4
GO:0032735BPGO:0032735positive regulation of interleukin-12 production 2/2340/15612 1.565186e-030.0878243080.067183994IFNG/IRF1 2
GO:0006775BPGO:0006775fat-soluble vitamin metabolic process 2/2343/15612 1.807145e-030.0960640130.073487218CYP4F3/IFNG 2
GO:0019815CCGO:0019815B cell receptor complex 1/234/16055 5.718533e-030.0998689190.084100143CD79B 1

Several nice plots can be drawn as described in Chapter 15 of the manual: http://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html.

Among available plots, we can draw:

  • barplots is the most widely used vizualization. It depict the enrichment scores (e.g. p values) and gene counts or ratio as bars whose length and color are proportional to the number of enriched genes.
In [22]:
# Code cell n°22
options(repr.plot.width = 12, repr.plot.height = 7)
barplot(enr_go, showCategory = 20,
        label_format = function(x) stringr::str_wrap(x, width = 120)) + #to be able to see terms description in a single row : play on the number (eg. 120)
        ggtitle("dotplot for ORA") ## uses ggplot2 you will see in session 4
  • dotplots that depict the enrichment scores (e.g. p values) and gene counts or ratio as dots whose size is proportional to the number of enriched genes. By default they are sorted by incerasing p adjust pvalue.
In [23]:
# Code cell n°23
options(repr.plot.width = 12, repr.plot.height = 7)
dotplot(enr_go,
        showCategory=20,
        label_format = function(x) stringr::str_wrap(x, width=  120)) +
        ggtitle("dotplot for ORA")
  • emapplots (enrichment map plots): Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets tend to cluster together, making it easy to identify functional module.
In [24]:
# Code cell n°24
options(repr.plot.width = 30, repr.plot.height= 10)
enr_go <- enrichplot::pairwise_termsim(enr_go, method = "JC", semData = "org.Hs.eg.db")
emapplot(enr_go, showCategory = 20)

Do not hesitate to rerun the above command if the figure displayed is not readable. The layout changes every time it is run.

  • treeplots: displays a hierarchical clustering of enriched terms (after pairwise step in previous cell)
In [25]:
# Code cell n°25
options(repr.plot.width = 30, repr.plot.height = 15)
treeplot(enr_go,
         hclust_method = "average")
  • upsetplots: displays the association between genes and gene sets. It emphasizes the gene overlapping among different gene sets.
In [26]:
# Code cell n°26
upsetplot(enr_go)
  • cne plots: depict the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network.

One plot often generated like the upsetplot is the cnetplot that depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network.

For some reason, it fails in this version of clusterProfiler installed on the environment and we cannot update it because the last version uses a newer R version. But below is the command to use that should work on your computer and a picture of what you should obtain.

cnetplot(enr_go, categorySize="pvalue", showCategory = 5)

image.png

There are many more plots available. The package clusterProfiler is in constant development with new plots.

3.1.B. Reactome¶

Similarly, we can run the ORA on Reactome pathways. It uses the gene IDs rather than the gene symbols. So we have to get them.

In [27]:
# Code cell n°27

geneID_sig <- subset(limma.full.outs$limma.fullmodel.out, adj.P.Val < 0.01 )[, "ENTREZ_GENE_ID"]
length(geneID_sig)
34
In [28]:
# Code cell n°28

enr_react <- enrichPathway(gene = geneID_sig, pvalueCutoff = 0.10, readable = TRUE)
head(enr_react)
A data.frame: 2 × 9
IDDescriptionGeneRatioBgRatiopvaluep.adjustqvaluegeneIDCount
<chr><chr><chr><chr><dbl><dbl><dbl><chr><int>
R-HSA-877312R-HSA-877312Regulation of IFNG signaling2/2014/108670.00028899390.032367320.02159849IFNG/STAT1 2
R-HSA-877300R-HSA-877300Interferon gamma signaling 3/2091/108670.00058412000.032710720.02182764IFNG/IRF1/STAT13

You can draw any figures as above.

In [29]:
# Code cell n°29

options(repr.plot.width = 12, repr.plot.height = 7)
dotplot(enr_react,
        showCategory = 20,
        label_format = function(x) stringr::str_wrap(x, width=  120)) +
        ggtitle("dotplot for ORA")

3.1.C. KEGG¶

Similarly, we can run the ORA on KEGG pathways. The trick here is not to use gene name symbols, nor Gene IDs but their matching Uniprot IDs. The function bitr() generates the correspondance when available.

In [30]:
# Code cell n°30

mygenes2 <- clusterProfiler::bitr(mygenes, fromType = 'SYMBOL', toType = 'UNIPROT', OrgDb = 'org.Hs.eg.db')
mygenes2
'select()' returned 1:many mapping between keys and columns

Warning message in clusterProfiler::bitr(mygenes, fromType = "SYMBOL", toType = "UNIPROT", :
“29.41% of input gene IDs are fail to map...”
A data.frame: 46 × 2
SYMBOLUNIPROT
<chr><chr>
1ASB1 Q9Y576
2ASB1 B9A047
3BIK A0A024R4X6
4BIK Q13323
7CASP6 P55212
8CD79B P40259
9CYP4F3 A0A024R7J8
10CYP4F3 Q08477
11CYP4F3 A0A024R7I2
12DPP7 Q9UHL4
13EXOC4 Q6NX51
14EXOC4 Q96A65
17FBXL6 Q8N531
18GLYATL2 A0A024R4Z5
19GLYATL2 Q8WU03
21IFNG P01579
22IRF1 P10914
23IRF1 Q6FHN8
24KIAA1671Q9BY89
29MSH3 P20585
30NME6 O75414
31NME6 C9JQB1
32NME6 A0A0C4DG91
33PCBP4 A0A024R320
34PCBP4 P57723
35PCBP4 C9J0A4
36RNASE6 Q6IB39
37RNASE6 Q93091
38SH3PXD2AB3KPL1
39SH3PXD2AQ5TCZ1
40SPACA3 Q05C28
41SPACA3 E9PF91
42SPACA3 Q8IXA5
43SPACA3 A0A080YUZ7
44ST3GAL6 A0A087WXB8
45ST3GAL6 Q9Y274
46STARD13 Q9Y3M8
47STARD13 B3KRK6
48STARD13 B2R789
49STARD13 A0A024RDV4
50STAT1 P42224
51TIMM10 P62072
52TSPAN12 A0A024R740
53TSPAN12 O95859
54UBD A0A1U9X8S6
55UBD O15205

As you can see, several UniprotIDs may exist for a given gene.

In [31]:
# Code cell n°31
table(mygenes2$SYMBOL)
    ASB1      BIK    CASP6    CD79B   CYP4F3     DPP7    EXOC4    FBXL6 
       2        2        1        1        3        1        2        1 
 GLYATL2     IFNG     IRF1 KIAA1671     MSH3     NME6    PCBP4   RNASE6 
       2        1        2        1        1        3        3        2 
SH3PXD2A   SPACA3  ST3GAL6  STARD13    STAT1   TIMM10  TSPAN12      UBD 
       2        4        2        4        1        1        2        2 

Thus, duplicated IDs must be removed if any UNIPROT IDs taken.

In [32]:
# Code cell n°32

mygenes2 <- mygenes2[!duplicated(mygenes2$SYMBOL), ]
mygenes2
mygeneList2 <-mygenes2$UNIPROT
mygeneList2
A data.frame: 24 × 2
SYMBOLUNIPROT
<chr><chr>
1ASB1 Q9Y576
3BIK A0A024R4X6
7CASP6 P55212
8CD79B P40259
9CYP4F3 A0A024R7J8
12DPP7 Q9UHL4
13EXOC4 Q6NX51
17FBXL6 Q8N531
18GLYATL2 A0A024R4Z5
21IFNG P01579
22IRF1 P10914
24KIAA1671Q9BY89
29MSH3 P20585
30NME6 O75414
33PCBP4 A0A024R320
36RNASE6 Q6IB39
38SH3PXD2AB3KPL1
40SPACA3 Q05C28
44ST3GAL6 A0A087WXB8
46STARD13 Q9Y3M8
50STAT1 P42224
51TIMM10 P62072
52TSPAN12 A0A024R740
54UBD A0A1U9X8S6
  1. 'Q9Y576'
  2. 'A0A024R4X6'
  3. 'P55212'
  4. 'P40259'
  5. 'A0A024R7J8'
  6. 'Q9UHL4'
  7. 'Q6NX51'
  8. 'Q8N531'
  9. 'A0A024R4Z5'
  10. 'P01579'
  11. 'P10914'
  12. 'Q9BY89'
  13. 'P20585'
  14. 'O75414'
  15. 'A0A024R320'
  16. 'Q6IB39'
  17. 'B3KPL1'
  18. 'Q05C28'
  19. 'A0A087WXB8'
  20. 'Q9Y3M8'
  21. 'P42224'
  22. 'P62072'
  23. 'A0A024R740'
  24. 'A0A1U9X8S6'

You do the same gene SYMBOL to UNIPROT conversion for the universe:

In [33]:
# Code cell n°33

prot_universe <- unique(probes$TargetID)
prot_universe <- clusterProfiler::bitr(prot_universe, fromType = 'SYMBOL', toType = 'UNIPROT', OrgDb = 'org.Hs.eg.db')
prot_universe <- prot_universe[!duplicated(prot_universe$SYMBOL), ]
prot_universe <- prot_universe$UNIPROT
'select()' returned 1:many mapping between keys and columns

Warning message in clusterProfiler::bitr(prot_universe, fromType = "SYMBOL", toType = "UNIPROT", :
“53.77% of input gene IDs are fail to map...”

You are now ready for the enrichment analysis on KEGG.

Unfortunately, the installed version of clusterProfiler in adenine is no longer able to run it and we cannot update it to a newer version which is based on R 4.2.0. But we put the commands below and an image of the expected results. On you personnal computer, it whould work.

enr_kegg <- enrichKEGG(gene = mygeneList2, keyType = "uniprot", minGSSize = 3, maxGSSize = 800, pvalueCutoff = 0.10, universe = prot_universe, organism = "hsa", pAdjustMethod = "none")

dotplot(enr_kegg)

image.png

enr_kegg <- pairwise_termsim(enr_kegg, method = "JC", semData = "org.Hs.eg.db") emapplot(enr_kegg)

image.png

enr_kegg2 <- setReadable(enr_kegg, OrgDb = "org.Hs.eg.db", "UNIPROT") cnetplot(enr_kegg2, categorySize="pvalue", showCategory = 5)

image.png

3.2. GSEA¶

GSEA, standing for Gene Set Enrichment Analysis, is a Significance Analysis of Function and Expression (SAFE) method. While ORAs (over-representation methods) are single-gene approaches, GSEA uses all genes in the dataset and their potential correlation. In addition, it uses quantitative metrics like gene expression or statistics results from a DGE analysis. This later is the most recommended approach to follow.

GSEA was developped at the Broad Institue. The documentation is available here. It was described in Subramanian et al. 2005

Finally we can run a Gene Set Enrichment Aanalysis (GSEA) directly with clusterProfiler. We need first to sort genes by a quantitative metrics, here the statistics of the DGE analysis.

We do it here for example with the Pat versus Control DGE results on GO terms.

In [34]:
# Code cell n°34

mygenes <- limma.full.outs[[6]]$B
names(mygenes) <- limma.full.outs[[6]]$TargetID
mygenes <- sort(mygenes, decreasing = T) 
head(mygenes)
KIAA1671
19.9679715869778
TIMM10
14.8937605533525
IFNG
11.5032650302589
BIK
10.567386404523
C10ORF125
10.0582311331783
PCBP4
9.55705479628613
In [35]:
# Code cell n°35

gsea_go <- gseGO(geneList = mygenes, 
             ont ="ALL", 
             keyType = "SYMBOL",
             minGSSize = 3, 
             maxGSSize = 800, 
             pvalueCutoff = 0.25, 
             verbose = TRUE, 
             OrgDb = org.Hs.eg.db, 
             pAdjustMethod = "none",
                 nPerm = 100)    # by default there are 1000 permutations
preparing geneSet collections...

GSEA analysis...

Warning message in .GSEA(geneList = geneList, exponent = exponent, minGSSize = minGSSize, :
“We do not recommend using nPerm parameter incurrent and future releases”
Warning message in fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize, :
“You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.”
Warning message in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
“There are duplicate gene names, fgsea may produce unexpected results.”
Warning message in fgseaSimple(...):
“There were 196 pathways for which P-values were not calculated properly due to unbalanced gene-level statistic values”
leading edge analysis...

done...

It may take a long time and use a lot of memory due to permutations.

Let's see the result of the top enriched pathway.

In [36]:
# Code cell n°36

gseaplot2(gsea_go, geneSetID = 1, title = gsea_go$Description[1])

We can also display several patways on the same plot.

In [37]:
# Code cell n°37

gseaplot2(gsea_go, geneSetID = 1:3, title = "top 3 gene sets", pvalue_table = TRUE,
          color = c("#E495A5", "#86B875", "#7DB0DD"), ES_geom = "dot")

We can also look at the usual figures implemented in clusterProfiler, such as dotplots :

In [38]:
# Code cell n°38

options(repr.plot.width = 17, repr.plot.height = 10)
dotplot(gsea_go, showCategory=20,
        label_format = function(x) stringr::str_wrap(x, width=  120))+
        ggtitle("dotplot for GSEA")

or emmaplots:

In [39]:
# Code cell n°39

options(repr.plot.width = 12, repr.plot.height = 7)

gsea_go <- enrichplot::pairwise_termsim(gsea_go, method = "JC", semData = "org.Hs.eg.db")
emapplot(gsea_go, showCategory = 20)

or the cneplots :

Here again, we provide you the commands and results but it does not run on this clusterProfiler version.

options(repr.plot.width=8, repr.plot.height= 8) cnetplot(gsea_go, categorySize="pvalue", showCategory = 5)

image.png

The size of the plot can impact the figure: see the same figure with higher height!

options(repr.plot.width = 13, repr.plot.height = 13) cnetplot(gsea_go, categorySize="pvalue", showCategory = 5)

image.png

We can also use the ridgeplot format which visualizes expression distributions of core enriched genes for GSEA enriched categories. It helps users to interpret up/down-regulated pathways.

In [40]:
# Code cell n°40

options(repr.plot.width = 30, repr.plot.height = 7)
ridgeplot(gsea_go, label_format = function(x) stringr::str_wrap(x, width=  120))
Picking joint bandwidth of 0.225

The x axis is the log2(FC) distribution of genes in pathways. Here all enriched pathways are downregulated!

4. GSEA with the Java application:¶

4.1. Preparing input files¶

The GSEA Java application is a stand-alone software to run GSEA on your own computer.

Two GSEA methods are implemented in the Java application:

  1. with unranked expression data: GSEA will cluster genes by similarity (different metrics/methods can be used).
  2. with pre-ranked expression data: this method is the recommended one. After DGE analysis, we usually upload the data ranked by the statistics, or (and this is equivalent) ranked by the signed pvalue (meaning a positive pvalue when the fold change is positive, a negative pvalue when the fold change is negative).

Whatever the method used, GSEA can only compare two conditions.

GSEA requires two input files:

  • an expression file
  • a phenotype file

The format for the different inputs is described here

4.1.A. Expression files:¶

There are 4 acceptable formats for expression input files: .gct, .res, .pcl and .txt. Here we will prepare files in .txt format with R. It requires to fill a first column NAMES with the names of the genes/probes, followed by a column DESCRIPTION that must be with NA values (a bug in GSEA requires this column although it is not informative; they will try to fix this issue in a future release), then a column for each sample.

  • if we start from normalised unranked values:

We use the norm.quant object.

In [41]:
# Code cell n°41

gsea_T1D_norm.quant <- data.frame("NAMES" = row.names(norm.quant), "DESCRIPTION" = NA, as.data.frame(norm.quant))
names(gsea_T1D_norm.quant)[-(1:2)] <- samples_info$array.labels
str(gsea_T1D_norm.quant)
write.table(gsea_T1D_norm.quant, file="gsea_T1D_norm.quant.txt", sep="\t", col.names = TRUE, quote = FALSE, row.names = FALSE)
'data.frame':	47323 obs. of  266 variables:
 $ NAMES       : chr  "6450255" "2570615" "6370619" "2600039" ...
 $ DESCRIPTION : logi  NA NA NA NA NA NA ...
 $ 5753669129_B: num  7.35 7.09 6.3 6.57 6.49 ...
 $ 5753669129_C: num  7.06 6.91 6.75 6.61 6.49 ...
 $ 5753669129_E: num  7.05 6.87 6.55 6.57 6.77 ...
 $ 5753669129_K: num  7.4 6.79 6.57 6.39 6.51 ...
 $ 5753669129_A: num  7.22 6.64 6.63 6.44 6.94 ...
 $ 5753669129_F: num  6.87 6.77 6.47 6.78 6.43 ...
 $ 5753669095_B: num  7.02 6.65 6.4 6.47 6.61 ...
 $ 5753669095_F: num  6.99 6.84 6.32 6.54 6.62 ...
 $ 5753669095_G: num  7.29 6.82 6.46 6.49 6.43 ...
 $ 5753669095_K: num  7.08 6.68 6.43 6.62 6.45 ...
 $ 5753669095_D: num  7.2 7 6.74 6.52 6.4 ...
 $ 5753669095_E: num  7.34 6.9 6.62 6.73 6.83 ...
 $ 5753669095_A: num  6.9 6.86 6.59 6.62 6.53 ...
 $ 5753669095_H: num  7.29 6.56 6.44 6.6 6.38 ...
 $ 5753669095_C: num  6.89 6.81 6.55 6.73 6.69 ...
 $ 5753669095_J: num  7 6.95 6.58 6.49 6.76 ...
 $ 5753669095_I: num  7.02 6.81 6.54 6.55 6.51 ...
 $ 5753669095_L: num  7.01 6.87 6.63 6.67 6.38 ...
 $ 5753669094_B: num  6.87 6.8 6.43 6.63 6.44 ...
 $ 5753669094_D: num  6.86 6.68 6.63 6.69 6.59 ...
 $ 5753669094_E: num  7.26 6.46 6.69 6.69 6.5 ...
 $ 5753669094_I: num  7.57 6.81 6.4 6.64 6.54 ...
 $ 5753669094_A: num  6.88 6.88 6.44 6.74 6.35 ...
 $ 5753669094_F: num  6.85 6.63 6.44 6.71 6.69 ...
 $ 5753669102_D: num  6.7 6.89 6.61 6.68 6.38 ...
 $ 5753669102_B: num  7.12 6.63 6.63 6.22 6.58 ...
 $ 5753669102_K: num  7.01 6.81 6.51 6.34 6.76 ...
 $ 5753669102_A: num  6.62 6.94 6.66 6.56 6.41 ...
 $ 5753669102_J: num  7.15 6.99 6.34 6.67 6.59 ...
 $ 5753669094_J: num  6.93 6.7 6.53 6.52 6.76 ...
 $ 5753669094_K: num  6.88 6.57 6.52 6.69 6.35 ...
 $ 5753669094_G: num  6.89 6.96 6.42 6.61 6.42 ...
 $ 5753669094_L: num  7.04 6.67 6.85 6.62 6.7 ...
 $ 5753669094_C: num  6.85 6.85 6.68 6.58 6.47 ...
 $ 5753669094_H: num  6.68 6.59 6.47 6.5 6.53 ...
 $ 5753669028_G: num  7.13 6.79 6.55 6.57 6.43 ...
 $ 5753669028_H: num  7.07 6.9 6.75 6.35 6.43 ...
 $ 5753669028_B: num  7.46 7.14 6.82 6.37 6.55 ...
 $ 5753669028_J: num  7.54 6.64 6.62 6.48 6.53 ...
 $ 5753669028_C: num  7.62 6.83 6.69 6.55 6.78 ...
 $ 5753669028_D: num  7.31 6.92 6.6 6.34 6.54 ...
 $ 5753669028_A: num  6.93 6.91 6.54 6.47 6.51 ...
 $ 5753669028_K: num  7.22 6.72 6.5 6.49 6.72 ...
 $ 5753669028_E: num  7.23 6.75 6.56 6.33 6.64 ...
 $ 5753669028_I: num  7.5 6.67 6.67 6.46 6.17 ...
 $ 5753669028_F: num  7.1 6.65 6.63 6.47 6.57 ...
 $ 5753669028_L: num  7.19 6.74 6.68 6.47 6.6 ...
 $ 5753669136_D: num  7.17 6.74 6.38 6.54 6.16 ...
 $ 5753669136_F: num  7.11 6.87 6.53 6.52 6.52 ...
 $ 5753669136_I: num  7.27 6.83 6.6 6.72 6.47 ...
 $ 5753669136_J: num  7.57 6.96 7.16 6.57 6.39 ...
 $ 5753669136_B: num  7.19 6.97 6.53 6.63 6.62 ...
 $ 5753669136_K: num  6.97 6.78 6.52 6.41 6.31 ...
 $ 5753669111_I: num  7.01 7.03 6.62 6.51 6.44 ...
 $ 5753669111_F: num  7.52 6.72 6.47 6.55 6.33 ...
 $ 5753669111_L: num  7.25 6.97 6.48 6.24 6.18 ...
 $ 5753669111_J: num  7.13 6.61 6.67 6.82 6.46 ...
 $ 5753669111_K: num  6.72 7.11 6.48 6.49 6.69 ...
 $ 5753669129_G: num  7.78 6.79 6.44 6.55 6.57 ...
 $ 5753669129_H: num  7.41 6.7 6.55 6.85 6.46 ...
 $ 5753669129_D: num  7.27 7.16 6.91 6.57 6.67 ...
 $ 5753669129_J: num  7.53 6.94 6.57 6.7 6.36 ...
 $ 5753669129_I: num  7.67 6.59 6.68 6.61 6.77 ...
 $ 5753669129_L: num  7.23 6.58 6.51 6.45 6.68 ...
 $ 5753669096_B: num  7.42 6.71 6.47 6.37 6.47 ...
 $ 5753669096_C: num  7.23 6.84 6.43 6.76 6.41 ...
 $ 5753669096_H: num  7.33 6.29 6.68 6.57 6.69 ...
 $ 5753669096_I: num  7.44 6.77 6.76 6.75 6.26 ...
 $ 5753669096_A: num  7.26 6.61 6.67 6.54 6.29 ...
 $ 5753669096_J: num  7.23 6.71 6.5 6.57 6.67 ...
 $ 5753669138_A: num  7.37 6.8 6.68 6.64 6.54 ...
 $ 5753669138_H: num  7.31 6.89 6.77 6.64 6.48 ...
 $ 5753669138_J: num  7.41 6.86 6.71 6.43 6.33 ...
 $ 5753669138_L: num  7.63 6.98 6.48 6.68 6.35 ...
 $ 5753669138_C: num  7.13 6.85 6.54 6.75 6.42 ...
 $ 5753669096_E: num  6.87 6.89 6.43 6.62 6.36 ...
 $ 5753669096_L: num  7.17 6.61 6.59 6.47 6.59 ...
 $ 5753669096_F: num  6.94 6.75 6.56 6.61 6.3 ...
 $ 5753669096_K: num  7.19 6.61 6.46 6.43 6.59 ...
 $ 5753669096_D: num  7.34 6.63 6.4 6.63 6.64 ...
 $ 5753669096_G: num  7.14 6.55 6.64 6.48 6.43 ...
 $ 5753669109_I: num  7.19 6.76 6.36 6.47 6.6 ...
 $ 5753669109_K: num  7.24 6.91 6.64 6.5 6.46 ...
 $ 5753669109_A: num  7.17 6.67 6.62 6.53 6.16 ...
 $ 5753669109_F: num  6.96 6.8 6.72 6.49 6.49 ...
 $ 5753669109_B: num  7.23 6.72 6.57 6.45 6.66 ...
 $ 5753669109_L: num  7.34 6.7 6.33 6.35 6.73 ...
 $ 5753669138_D: num  7.16 6.82 6.53 6.44 6.46 ...
 $ 5753669138_B: num  7.34 6.75 6.61 6.57 6.54 ...
 $ 5753669138_G: num  7.57 7.09 6.56 6.51 6.53 ...
 $ 5753669138_F: num  7.34 7 6.65 6.51 6.49 ...
 $ 5753669138_I: num  7.39 6.8 6.41 6.77 6.48 ...
 $ 5753669136_A: num  7.14 6.71 6.63 6.68 6.47 ...
 $ 5753669136_E: num  7 6.72 6.55 6.57 6.63 ...
 $ 5753669136_G: num  7.14 6.77 6.67 6.58 6.33 ...
 $ 5753669136_H: num  7.53 6.62 6.62 6.83 6.49 ...
 $ 5753669136_C: num  7.1 6.88 6.61 6.5 6.42 ...
  [list output truncated]
  • if we start from preranked values: using the signed pvalues

The format is RNK: Ranked list file format (*.rnk) Here are the command for the 5th contrast as an example. You can do the same for other constrats.

In [42]:
# Code cell n°42

gsea_T1D_PatVsCont <- data.frame(limma.full.outs[[6]]$ProbeID, limma.full.outs[[6]]$B)
write.table(gsea_T1D_PatVsCont, file="gsea_T1D_PatVsCont.rnk", sep="\t", col.names = FALSE, quote = FALSE, row.names = FALSE)

4.1.B. Phenotype files:¶

The format .cls format is the one for phenotypes.

  1. On the first line, you specify 3 values separated by spaces:
  • the number of samples
  • the number of classes (or levels of your factor of interest)
  • 1: always, don't change it
  1. On the second line, you put names for each class as they will appear in analysis report. The line should begin with a hashtag sign (#) followed by a space.

  2. The third line contains a class label for each sample. The class label can be the class name, a number, or a text string. The first label used is assigned to the first class named on the second line; the second unique label is assigned to the second class named; and so on. The number of class labels specified on this line should be the same as the number of samples specified in the first line. The number of unique class labels specified on this line should be the same as the number of classes specified in the first line.

Here, we will prepare two phenotype files.

  • You can prepare one phenotype file per contrast of interest as for example between patients and controls (whathever the stimulus):
In [43]:
# Code cell n°43

temp <- samples_info$Status
temp <- ifelse(temp == 2, "PAT", "CONT")
gsea_pheno_PatVsCont <- file("gsea_pheno_PatVsCont.cls")
my_text <- paste("264 2 1\n#PAT CONT\n", paste(temp, collapse=" "))
writeLines(my_text, gsea_pheno_PatVsCont)
close(gsea_pheno_PatVsCont)
rm(temp)

You should see the gsea_pheno_PatVsCont file in the left column. Open it to see its structure.

  • You can also prepare a phenotype file with all groups depending on status and stimulus: g.1.0, g.1.6, g.1.24, g.2.0, g.2.6, g.2.24:
In [44]:
# Code cell n°44

str(samples_info)
'data.frame':	264 obs. of  8 variables:
 $ array.labels: chr  "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ...
 $ PedID       : chr  "1" "1" "1" "1" ...
 $ ID          : chr  "1" "1" "1" "1" ...
 $ Status      : chr  "2" "2" "2" "2" ...
 $ Stim        : chr  "0" "0" "6" "6" ...
 $ Full        : chr  "1_1_2_0" "1_1_2_0" "1_1_2_6" "1_1_2_6" ...
 $ Sex         : chr  "1" "1" "1" "1" ...
 $ Age         : int  10 10 10 10 10 10 18 18 18 18 ...
In [45]:
# Code cell n°45

temp <- samples_info$Status
temp <- ifelse(temp == 2, "PAT", "CONT")
temp <- paste(temp, samples_info$Stim, sep = "")
temp
  1. 'PAT0'
  2. 'PAT0'
  3. 'PAT6'
  4. 'PAT6'
  5. 'PAT24'
  6. 'PAT24'
  7. 'CONT0'
  8. 'CONT0'
  9. 'CONT6'
  10. 'CONT6'
  11. 'CONT24'
  12. 'CONT24'
  13. 'CONT0'
  14. 'CONT0'
  15. 'CONT6'
  16. 'CONT6'
  17. 'CONT24'
  18. 'CONT24'
  19. 'PAT0'
  20. 'PAT0'
  21. 'PAT6'
  22. 'PAT6'
  23. 'PAT24'
  24. 'PAT24'
  25. 'CONT0'
  26. 'CONT6'
  27. 'CONT6'
  28. 'CONT24'
  29. 'CONT24'
  30. 'PAT0'
  31. 'PAT0'
  32. 'PAT6'
  33. 'PAT6'
  34. 'PAT24'
  35. 'PAT24'
  36. 'PAT0'
  37. 'PAT0'
  38. 'PAT6'
  39. 'PAT6'
  40. 'PAT24'
  41. 'PAT24'
  42. 'CONT0'
  43. 'CONT0'
  44. 'CONT6'
  45. 'CONT6'
  46. 'CONT24'
  47. 'CONT24'
  48. 'CONT0'
  49. 'CONT0'
  50. 'CONT6'
  51. 'CONT6'
  52. 'CONT24'
  53. 'CONT24'
  54. 'CONT0'
  55. 'CONT6'
  56. 'CONT6'
  57. 'CONT24'
  58. 'CONT24'
  59. 'PAT0'
  60. 'PAT0'
  61. 'PAT6'
  62. 'PAT6'
  63. 'PAT24'
  64. 'PAT24'
  65. 'CONT0'
  66. 'CONT0'
  67. 'CONT6'
  68. 'CONT6'
  69. 'CONT24'
  70. 'CONT24'
  71. 'CONT0'
  72. 'CONT0'
  73. 'CONT6'
  74. 'CONT6'
  75. 'CONT24'
  76. 'PAT0'
  77. 'PAT0'
  78. 'PAT6'
  79. 'PAT6'
  80. 'PAT24'
  81. 'PAT24'
  82. 'CONT0'
  83. 'CONT0'
  84. 'CONT6'
  85. 'CONT6'
  86. 'CONT24'
  87. 'CONT24'
  88. 'CONT0'
  89. 'CONT6'
  90. 'CONT6'
  91. 'CONT24'
  92. 'CONT24'
  93. 'CONT0'
  94. 'CONT0'
  95. 'CONT6'
  96. 'CONT6'
  97. 'CONT24'
  98. 'CONT24'
  99. 'PAT0'
  100. 'PAT0'
  101. 'PAT6'
  102. 'PAT6'
  103. 'PAT24'
  104. 'PAT24'
  105. 'CONT0'
  106. 'CONT0'
  107. 'CONT6'
  108. 'CONT6'
  109. 'CONT24'
  110. 'CONT24'
  111. 'CONT0'
  112. 'CONT6'
  113. 'CONT6'
  114. 'CONT24'
  115. 'CONT24'
  116. 'CONT0'
  117. 'CONT0'
  118. 'CONT6'
  119. 'CONT24'
  120. 'CONT24'
  121. 'PAT0'
  122. 'PAT0'
  123. 'PAT6'
  124. 'PAT6'
  125. 'PAT24'
  126. 'PAT24'
  127. 'CONT0'
  128. 'CONT0'
  129. 'CONT6'
  130. 'CONT6'
  131. 'CONT24'
  132. 'CONT24'
  133. 'CONT0'
  134. 'CONT0'
  135. 'CONT6'
  136. 'CONT6'
  137. 'CONT24'
  138. 'CONT24'
  139. 'PAT0'
  140. 'PAT0'
  141. 'PAT6'
  142. 'PAT6'
  143. 'PAT24'
  144. 'PAT24'
  145. 'PAT0'
  146. 'PAT0'
  147. 'PAT6'
  148. 'PAT6'
  149. 'PAT24'
  150. 'PAT24'
  151. 'CONT0'
  152. 'CONT0'
  153. 'CONT6'
  154. 'CONT6'
  155. 'CONT24'
  156. 'CONT24'
  157. 'CONT0'
  158. 'CONT0'
  159. 'CONT6'
  160. 'CONT6'
  161. 'CONT24'
  162. 'CONT24'
  163. 'PAT0'
  164. 'PAT0'
  165. 'PAT6'
  166. 'PAT6'
  167. 'PAT24'
  168. 'PAT24'
  169. 'CONT0'
  170. 'CONT0'
  171. 'CONT6'
  172. 'CONT6'
  173. 'CONT24'
  174. 'CONT24'
  175. 'PAT0'
  176. 'PAT0'
  177. 'PAT6'
  178. 'PAT6'
  179. 'PAT24'
  180. 'PAT24'
  181. 'PAT0'
  182. 'PAT0'
  183. 'PAT6'
  184. 'PAT6'
  185. 'PAT24'
  186. 'PAT24'
  187. 'CONT0'
  188. 'CONT0'
  189. 'CONT6'
  190. 'CONT6'
  191. 'CONT24'
  192. 'CONT24'
  193. 'PAT0'
  194. 'PAT0'
  195. 'PAT6'
  196. 'PAT6'
  197. 'PAT24'
  198. 'PAT24'
  199. 'CONT0'
  200. 'CONT0'
  201. 'CONT6'
  202. 'CONT6'
  203. 'CONT24'
  204. 'CONT24'
  205. 'PAT0'
  206. 'PAT0'
  207. 'PAT6'
  208. 'PAT6'
  209. 'PAT24'
  210. 'PAT24'
  211. 'CONT0'
  212. 'CONT0'
  213. 'CONT6'
  214. 'CONT6'
  215. 'CONT24'
  216. 'CONT24'
  217. 'PAT0'
  218. 'PAT0'
  219. 'PAT6'
  220. 'PAT6'
  221. 'PAT24'
  222. 'PAT24'
  223. 'CONT0'
  224. 'CONT0'
  225. 'CONT6'
  226. 'CONT6'
  227. 'CONT24'
  228. 'CONT24'
  229. 'PAT0'
  230. 'PAT0'
  231. 'PAT6'
  232. 'PAT6'
  233. 'PAT24'
  234. 'PAT24'
  235. 'PAT0'
  236. 'PAT0'
  237. 'PAT6'
  238. 'PAT6'
  239. 'PAT24'
  240. 'PAT24'
  241. 'CONT0'
  242. 'CONT0'
  243. 'CONT6'
  244. 'CONT6'
  245. 'CONT24'
  246. 'CONT24'
  247. 'PAT0'
  248. 'PAT0'
  249. 'PAT6'
  250. 'PAT6'
  251. 'PAT24'
  252. 'PAT24'
  253. 'PAT0'
  254. 'PAT0'
  255. 'PAT6'
  256. 'PAT6'
  257. 'PAT24'
  258. 'PAT24'
  259. 'CONT0'
  260. 'CONT0'
  261. 'CONT6'
  262. 'CONT6'
  263. 'CONT24'
  264. 'CONT24'
In [46]:
# Code cell n°46

gsea_pheno_groups <- file("gsea_pheno_groups.cls")
my_text <- paste("264 6 1\n#PAT0 PAT6 PAT24 CONT0 CONT6 CONT24\n", paste(temp, collapse=" "))
writeLines(my_text, gsea_pheno_PatVsCont)
close(gsea_pheno_groups)

4.2. Running GSEA (Java Stand-Alone Application)¶

You are ready to run your first GSEA analyses!

Download the Java application from here with the appropriate version for your OS. Unzip the downloaded file and launch GSEA by double-clicking on gsea.bat if you are on Windows, by double-clicking on the App or on the gsea.command if you are on macOS, or by typing the command ./gsea.sh if you are on Linux or on macOS.

In File>Preferences, specify the directoty where you want the GSEA results to be saved. By default it is in your home repository /gsea_home/output.

Click on Load Data and upload the different input files. No error must be reported if your formats are correct.

4.2.A. If you are using the **method with unranked values**:¶

  • Click on Run GSEA

  • Select the correct in Required fields the correct filed and parameters:

    • gsea_T1D_norm.quant.txt for the expression dataset
    • choose the gene set database of interest
    • start with keeping 1000 permutations: if you have pvalues displayed at 0, then rerun the analysis increasing the number of permutations
    • for phenotype labels, select one of the .cls file and the contrast of interest
    • keep the option collapse to deal with several probes for the same gene
    • choose phenotype for permutation if you have several samples per category (7 are recommanded), otherwise select gene_set but the statistical test will not be able to account for putative correlations between genes.
    • select here Human_ILLUMINA_Array_MSigDB.v2022.1.Hs.chip since we are working with Illumina array data and we kept the probeID identifiers. Should you be working with unique gene symbols, select the HGNC platform.

  • In Basic fields enter a useful name for your analysis and select a method to rank genes. You can also specify the output folder name.

  • Click on Run at the bottom.

  • You can then run a Leading edge analysis and vizualize enrichment map. The enrichment outputs can also be imported to Cytoscape using the Cytoscape app EnrichmentMap.

4.2.B. If you are using the **method with pre-ranked values**:¶

  • Click on Run GSEAPreranked

  • Select in Required fileds the correct files and parameters:

    • choose the gene set database of interest
    • start with keeping 1000 permutations: if you have pvalues displayed at 0, then rerun the analysis by increasing the number of permutations
    • for the ranked list, select for example gsea_T1D_PatVsCont.rbk
    • keep the option collapse to deal with several probes for the same gene
    • select here Human_ILLUMINA_Array_MSigDB.v2022.1.Hs.chip since we are working with Illumina array data and we kept the probeID identifiers. Should you be working with unique gene symbols, select the HGNC platform.

  • In Basic fields enter a useful name for your analysis and select a method to rank genes. You can also specify the output folder name.
Well done: If your analysis was succesfull, you see Success in green in the "GSEA Reports" frame on the left panel. Just click on Success to look at the results. Follow the guide to understand the outputs. A FDR of 25% is the minimum required to consider a significant enrichment.

Note: GSEA in R:

  • an R package exist to run GSEA: GSEABase
  • the package clusterProfiler implements GSEA (see part 3.2 of this tutorial) and even a fastest approach (fgsea).


Success: Well done! You now know how to draw annotated heatmaps and perform enrichment analyses, either ORAs or GSEAs.
What can you try by yourself? Here are some ideas:
- try to redo the analyses with other DGE outputs: above we used either the full model or one contrast only.
- try other cutoffs for significant genes and redo the analyses
- try other parameters for the clustering: in the heatmaps, or in GSEA Java application.
- try other packages for clustering, heatmaps or enrichment analyses
- try other clustering methods like kmeans
- try other integrative approaches like WGCNA

[version 13/11/2022 - last revision:@SCaburet]